process-based model
Biogeochemistry-Informed Neural Network (BINN) for Improving Accuracy of Model Prediction and Scientific Understanding of Soil Organic Carbon
Xu, Haodi, Fan, Joshua, Tao, Feng, Jiang, Lifen, You, Fengqi, Houlton, Benjamin Z., Sun, Ying, Gomes, Carla P., Luo, Yiqi
Big data and the rapid development of artificial intelligence (AI) provide unprecedented opportunities to enhance our understanding of the global carbon cycle and other biogeochemical processes. However, retrieving mechanistic knowledge from big data remains a challenge. Here, we develop a Biogeochemistry-Informed Neural Network (BINN) that seamlessly integrates a vectorized process-based soil carbon cycle model (i.e., Community Land Model version 5, CLM5) into a neural network (NN) structure to examine mechanisms governing soil organic carbon (SOC) storage from big data. BINN demonstrates high accuracy in retrieving biogeochemical parameter values from synthetic data in a parameter recovery experiment. We use BINN to predict six major processes regulating the soil carbon cycle (or components in process-based models) from 25,925 observed SOC profiles across the conterminous US and compared them with the same processes previously retrieved by a Bayesian inference-based PROcess-guided deep learning and DAta-driven modeling (PRODA) approach (Tao et al. 2020; 2023). The high agreement between the spatial patterns of the retrieved processes using the two approaches with an average correlation coefficient of 0.81 confirms BINN's ability in retrieving mechanistic knowledge from big data. Additionally, the integration of neural networks and process-based models in BINN improves computational efficiency by more than 50 times over PRODA. We conclude that BINN is a transformative tool that harnesses the power of both AI and process-based modeling, facilitating new scientific discoveries while improving interpretability and accuracy of Earth system models.
Hybrid Phenology Modeling for Predicting Temperature Effects on Tree Dormancy
van Bree, Ron, Marcos, Diego, Athanasiadis, Ioannis
Biophysical models offer valuable insights into climate-phenology relationships in both natural and agricultural settings. However, there are substantial structural discrepancies across models which require site-specific recalibration, often yielding inconsistent predictions under similar climate scenarios. Machine learning methods offer data-driven solutions, but often lack interpretability and alignment with existing knowledge. We present a phenology model describing dormancy in fruit trees, integrating conventional biophysical models with a neural network to address their structural disparities. We evaluate our hybrid model in an extensive case study predicting cherry tree phenology in Japan, South Korea and Switzerland. Our approach consistently outperforms both traditional biophysical and machine learning models in predicting blooming dates across years. Additionally, the neural network's adaptability facilitates parameter learning for specific tree varieties, enabling robust generalization to new sites without site-specific recalibration. This hybrid model leverages both biophysical constraints and data-driven flexibility, offering a promising avenue for accurate and interpretable phenology modeling.
Deep learning meets tree phenology modeling: PhenoFormer vs. process-based models
Garnot, Vivien Sainte Fare, Spafford, Lynsay, Lever, Jelle, Sigg, Christian, Pietragalla, Barbara, Vitasse, Yann, Gessler, Arthur, Wegner, Jan Dirk
Phenology, the timing of cyclical plant life events such as leaf emergence and coloration, is crucial in the bio-climatic system. Climate change drives shifts in these phenological events, impacting ecosystems and the climate itself. Accurate phenology models are essential to predict the occurrence of these phases under changing climatic conditions. Existing methods include hypothesis-driven process models and data-driven statistical approaches. Process models account for dormancy stages and various phenology drivers, while statistical models typically rely on linear or traditional machine learning techniques. Research shows that process models often outperform statistical methods when predicting under climate conditions outside historical ranges, especially with climate change scenarios. However, deep learning approaches remain underexplored in climate phenology modeling. We introduce PhenoFormer, a neural architecture better suited than traditional statistical methods at predicting phenology under shift in climate data distribution, while also bringing significant improvements or performing on par to the best performing process-based models. Our numerical experiments on a 70-year dataset of 70,000 phenological observations from 9 woody species in Switzerland show that PhenoFormer outperforms traditional machine learning methods by an average of 13% R2 and 1.1 days RMSE for spring phenology, and 11% R2 and 0.7 days RMSE for autumn phenology, while matching or exceeding the best process-based models. Our results demonstrate that deep learning has the potential to be a valuable methodological tool for accurate climate-phenology prediction, and our PhenoFormer is a first promising step in improving phenological predictions before a complete understanding of the underlying physiological mechanisms is available.
UFLUX v2.0: A Process-Informed Machine Learning Framework for Efficient and Explainable Modelling of Terrestrial Carbon Uptake
Dong, Wenquan, Zhu, Songyan, Xu, Jian, Ryan, Casey M., Chen, Man, Zeng, Jingya, Yu, Hao, Cao, Congfeng, Shi, Jiancheng
Gross Primary Productivity (GPP), the amount of carbon plants fixed by photosynthesis, is pivotal for understanding the global carbon cycle and ecosystem functioning. Process-based models built on the knowledge of ecological processes are susceptible to biases stemming from their assumptions and approximations. These limitations potentially result in considerable uncertainties in global GPP estimation, which may pose significant challenges to our Net Zero goals. This study presents UFLUX v2.0, a process-informed model that integrates state-of-art ecological knowledge and advanced machine learning techniques to reduce uncertainties in GPP estimation by learning the biases between process-based models and eddy covariance (EC) measurements. In our findings, UFLUX v2.0 demonstrated a substantial improvement in model accuracy, achieving an R^2 of 0.79 with a reduced RMSE of 1.60 g C m^-2 d^-1, compared to the process-based model's R^2 of 0.51 and RMSE of 3.09 g C m^-2 d^-1. Our global GPP distribution analysis indicates that while UFLUX v2.0 and the process-based model achieved similar global total GPP (137.47 Pg C and 132.23 Pg C, respectively), they exhibited large differences in spatial distribution, particularly in latitudinal gradients. These differences are very likely due to systematic biases in the process-based model and differing sensitivities to climate and environmental conditions. This study offers improved adaptability for GPP modelling across diverse ecosystems, and further enhances our understanding of global carbon cycles and its responses to environmental changes.
Knowledge-guided Machine Learning: Current Trends and Future Prospects
Karpatne, Anuj, Jia, Xiaowei, Kumar, Vipin
This is especially true in environmental sciences that are rapidly transitioning from being data-poor to data-rich, e.g., with the ever-increasing volumes of environmental data being collected by Earth observing satellites, in-situ sensors, and those generated by model simulations (e.g., climate model runs [113]). Similar to how recent developments in ML has transformed how we interact with the information on the Internet, it is befitting to ask how ML advances can enable Earth system scientists to transform a fundamental goal in science, which is to build better models of physical, biological, and environmental systems. The conventional approach for modeling relationships between input drivers and response variables is to use process-based models rooted in scientific equations. Despite their ability to leverage the mechanistic understanding of scientific phenomena, process-based models suffer from several shortcomings limiting their adoption in complex real-world settings, e.g., due to imperfections in model formulations (or modeling bias), incorrect choices of parameter values in equations, and high computational costs in running high-fidelity simulations. In response to these challenges, ML methods offer a promising alternative to capture statistical relationships between inputs and outputs directly from data. However, "black-box" ML models, that solely rely on the supervision contained in data, show limited generalizability in scientific problems, especially when applied to out-of-distribution data. One of the reasons for this lack of generalizability is the limited scale of data in scientific disciplines in contrast to mainstream applications of AI and ML where large-scale datasets in computer vision and natural language modeling have been instrumental in the success of state-of-the-art AI/ML models. Another fundamental deficiency in black-box ML models is their tendency to produce results that are inconsistent with existing scientific theories and their inability to provide a mechanistic understanding of discovered patterns and relationships from data, limiting their usefulness in science.
Time Series Predictions in Unmonitored Sites: A Survey of Machine Learning Techniques in Water Resources
Willard, Jared D., Varadharajan, Charuleka, Jia, Xiaowei, Kumar, Vipin
Prediction of dynamic environmental variables in unmonitored sites remains a long-standing challenge for water resources science. The majority of the world's freshwater resources have inadequate monitoring of critical environmental variables needed for management. Yet, the need to have widespread predictions of hydrological variables such as river flow and water quality has become increasingly urgent due to climate and land use change over the past decades, and their associated impacts on water resources. Modern machine learning methods increasingly outperform their process-based and empirical model counterparts for hydrologic time series prediction with their ability to extract information from large, diverse data sets. We review relevant state-of-the art applications of machine learning for streamflow, water quality, and other water resources prediction and discuss opportunities to improve the use of machine learning with emerging methods for incorporating watershed characteristics into deep learning models, transfer learning, and incorporating process knowledge into machine learning models. The analysis here suggests most prior efforts have been focused on deep learning learning frameworks built on many sites for predictions at daily time scales in the United States, but that comparisons between different classes of machine learning methods are few and inadequate. We identify several open questions for time series predictions in unmonitored sites that include incorporating dynamic inputs and site characteristics, mechanistic understanding and spatial context, and explainable AI techniques in modern machine learning frameworks.
Differentiable, learnable, regionalized process-based models with physical outputs can approach state-of-the-art hydrologic prediction accuracy
Feng, Dapeng, Liu, Jiangtao, Lawson, Kathryn, Shen, Chaopeng
Predictions of hydrologic variables across the entire water cycle have significant value for water resource management as well as downstream applications such as ecosystem and water quality modeling. Recently, purely data-driven deep learning models like long short-term memory (LSTM) showed seemingly-insurmountable performance in modeling rainfall-runoff and other geoscientific variables, yet they cannot predict untrained physical variables and remain challenging to interpret. Here we show that differentiable, learnable, process-based models (called {\delta} models here) can approach the performance level of LSTM for the intensively-observed variable (streamflow) with regionalized parameterization. We use a simple hydrologic model HBV as the backbone and use embedded neural networks, which can only be trained in a differentiable programming framework, to parameterize, enhance, or replace the process-based model modules. Without using an ensemble or post-processor, {\delta} models can obtain a median Nash Sutcliffe efficiency of 0.732 for 671 basins across the USA for the Daymet forcing dataset, compared to 0.748 from a state-of-the-art LSTM model with the same setup. For another forcing dataset, the difference is even smaller: 0.715 vs. 0.722. Meanwhile, the resulting learnable process-based models can output a full set of untrained variables, e.g., soil and groundwater storage, snowpack, evapotranspiration, and baseflow, and later be constrained by their observations. Both simulated evapotranspiration and fraction of discharge from baseflow agreed decently with alternative estimates. The general framework can work with models with various process complexity and opens up the path for learning physics from big data.
Deep Learning: A Next-Generation Big-Data Approach for Hydrology - Eos
In popular culture, Artificial Intelligence (AI) often refers to machines that can perform any intellectual task that humans can. Such machines are heavily romanticized and are still very far from becoming a reality. However, weak (or narrow) AIs, algorithms that are designed to perform a specific task, have shown a formidable intellectual prowess that surpasses human capabilities in certain tasks. These machines must have integrative decision-making capability based on what they receive and what they predict would happen. Take, for example, AlphaGo, the AI that famously defeated world champions at the ancient game "Go."